Chapter 10 Structured Corpus

There are a lot of pre-collected corpora available for linguistic studies. This chapter will demonstrate how you can load existing corpora in R and perform basic corpus analysis with these data.

10.1 NCCU Spoken Mandarin

In this demonstration, I will use the dataset of Taiwan Mandarin Corpus for illustration. This dataset, collected by Prof. Kawai Chui at National Cheng-Chi University, includes spontaneous face-to-face conversations of Taiwan Mandarin. The data transcription conventions can be found in the NCCU Corpus Official Website.

Generally, the corpus transcripts follow the conventions of CHILDES format. In computational text analytics, the first step is always to analyze the structure of the textual data.

10.2 CHILDES Format

The following is an excerpt from the file demo_data/data-nccu-M001.cha from the NCCU Corpus of Taiwan Mandarin. The conventions of CHILDES transcription include:

  • The lines with header information begin with @
  • The lines with utterances begin with @
  • The indented lines refer to the utterances of the continuing speaker turn
  • Words are separated by spaces

The meanings of transcription symbols used in the corpus can be found in the documention of the corpus.

10.3 Loading the Corpus

The corpus data is available in our demo_data/corp-NCCU-SPOKEN.tar.gz, which is a zipped archived file, i.e., one zipped tar file including all the corpus documents.

We can use the readtext::readtext() to easily load the data.

In this step, we treat all the *.cha files as if they are normal text files (i.e. .txt) and load the entire corpus into a data frame with two columns: doc_id and text (The warning messages only warn you that by default readtext() takes only .txt files).

10.4 From Text-based to Turn-based DF

Before we do the turn-tokenization, we first concatenate all same-turn utterances (but with no speaker ID at the initial of the line) with their initial utterance of the speaker turn, and then we use unnest_tokens() to transform the text-based DF into a turn-based DF.

10.5 Metadata vs. Utterances

Lines starting with @ are the headers of the transcript while lines starting with * are the utterances of the conversation. We split our NCCU_turns into:

  • NCCU_turns_meta: a DF with all header lines
  • NCCU_turns_utterance: a DF with all utterance lines

10.6 Word-based DF and Frequency List

As all the words have been separated by spaces, we can easily transform the turn-based DF into a word-based DF using unnest_tokens(). The key is that we specify our own tokenization function token =....

With word frequencies, we can generate a word cloud to have a quick overview of the word distributions in NCCU corpus.

10.7 Concordances

If we need to identify turns with a particular linguistic unit, we can make use of the data wrangling tricks to easily extract speaker turns with the target pattern.


Exercise 10.1 If we are interested in the use of the verb 覺得. After we extract all the speaker turns with the verb 覺得, we may need to know the subjects that often go with the verb.

  1. Please identify the word before the verb for each concordance token as one independent column of the resulting data frame (see below). Please note that one speaker turn may have more than one use of 覺得.

  2. Please create a barplot as shown below to summarize the distribution of the top 10 frequent words that directly precedes 覺得.

  3. Among the top 10 words, you would see “的 覺得” combinations, which are counter-intuitive. Please examine these tokens and explain why.


10.8 Collocations (Bigrams)

Now we extend our analysis beyond single words.

Please recall the ngram_chi() function we have defined and used several times in previous chapters.

We use the self-defined tokenization function and unnest_tokens() to transform the turn-based DF into a bigram-based DF.

To determine the bigrams that are significant, we compute their relevant distributional statistics, including:

  • frequencies
  • dispersion
  • collocation (lexical associations)

To compute the lexical associations, we need to:

  • remove bigrams with para-linguistic tags
  • exclude bigrams of low dispersion

10.9 N-grams (Lexical Bundles)

We can also extend our analysis to n-grams of larger sizes, i.e., the lexical bundles.

10.10 Connecting SPID to Metadata

So far the previous analyses have not used any information of the headers. In other words, the connection between the utterances and their corresponding speakers’ profiles are not transparent in our current corpus analysis. However, for social-linguists, the headers of the transcripts can be very informative.

Here I would like to demonstrate how to extract speaker-related information from the headers and link these speaker profiles to our corpus data.

10.11 Corpus Headers

Based on the metadata of each file header, we can extract demographic information related to each speaker, including their ID, age, gender, etc.

In the headers of each transcript, the demographic profiles of each speaker are provided in the lines starting with @id:\t; and each piece of information is separated by a pipe sign | in the line. All speakers’ profiles in the corpus follow the same structure.

10.12 Sociolinguistic Analyses

Now with NCCU_meta and NCCU_turns_utterance, we can easily connect each utterance to a particular speaker (via SPID in NCCU_turns_utterance and DOC_SPID in NCCU_meta) and therefore study the linguistic variation across speakers of varying sub-groups/communities. The steps are as follows:

  1. We first extract the patterns we are interested in from NCCU_turns_utterance;
  2. We then connect the concordance tokens to their corresponding SPID profiles in NCCU_meta;
  3. We analyze how the patterns vary according to speakers of different profiles.

10.12.1 Check Bigrams Distribution By Age Groups

10.12.3 Bigram Word clouds by Age


Exercise 10.2 Please create a barplot, showing the top 20 bigrams ranked according to the bigram frequencies for each age group. Also, in the graph please include the information of dispersion for each bigram, using the transparency of the bars. The more transparent, the less dispersed (See below).

Exercise 10.3 Please create a barplot, showing the top 20 bigrams ranked according to the bigram frequencies for each male and female speakers. Also, in the graph please include the information of dispersion for each bigram, using the transparency of the bars. The more transparent, the less dispersed (See below).